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In this paper we discuss a novel data compression technique for binary symmetric sources based 
on the cavity method over GF(g), the Galois Field of order q. We present a scheme of low complexity 
and near optimal empirical performance. 

The compression step is based on a reduction of a sparse low density parity check codes over 
GF(g) and is done through the so called reinforced belief-propagation equations. These reduced 
codes appear to have a non-trivial geometrical modification of the space of codewords which makes 
, such compression computationally feasible. The computational complexity is 0(d ■ n ■ q ■ log 2 q) per 

■ iteration, where d is the average degree of the check nodes and n is the number of bits. 

For our code ensemble, decompression can be done in a time linear in the code's length by a 
' simple leaf-removal algorithm. 
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I. INTRODUCTION 



The relation between information theory and statistical mechanics of disordered systems has been long stablished 
1, 2\. Since then, various techniques from statistical physics of disordered systems have been used not only to assess 
the theoretical bounds of the achievable performance but also to provide practical encoding/decoding methods for 
lossy data compression. In particular, both cavity method and replica symmetry breaking techniques have been used 
to demonstrate the Shannon results and assess the performance of codes defined on sparse factor graphs 0-0. 
■ In this paper we address the classical problem of finding an efficient lossy compression scheme for a generic binary 
£C} , symmetric source. This objective is reached by exploiting some unexpected features of the cavity method when applied 
CS| ■ to graphical codes denned over a finite field algebra of high order. 

^^O Given any realization y £ {0, 1}™ of a symmetric Bernoulli process Y, the goal is to compress y by mapping it to a 

shorter binary vector such that an approximate reconstruction of y is possible within a given fidelity criterion. More 
precisely, suppose y is mapped to the binary vector x 6 {0, l} fc with k < n and y is the reconstructed source sequence. 
The quantity R = k/n is called the compression rate. The fidelity or distortion is measured by the Hamming distance 
<fe(y ; y) = \Vi ~ Vi\- The goal is to minimize the average Hamming distortion D = E[dir(Y,Y)] for 

J> . any given rate. The asymptotic limit, known as the rate-distortion function, is given by R(D) = 1 — H(D) for any 
D E [0, 0.5] where H(D) = — £>log 2 D — (1 — D) log 2 (l — D) is the binary entropy function. 

Our approach in this paper is based on Low-Density Parity-Check (LDPC) codes. Let C be a LDPC code with 
> k x n generator matrix G and m x n parity check matrix H. Encoding in lossy compression can be implemented 
like decoding in error correction. Given a source sequence y, we look for a codeword y £ C such that djj (y,y) is 
minimized. The compressed sequence x is obtained as the k information bits that satisfy y = G T x. 

Even though LDPC codes have been successfully used for various types of lossless data compression schemes [7|, 
and also the existence of ensembles that asymptotically achieve the Shannon's bound for binary symmetric sources 
has been proved 0, they have not been fully explored for lossy data compression. It is partially due to the long 
standing problem of finding a practical source-coding algorithm for LDPC codes, and partially because Low-Density 
Generator Matrix (LDGM) codes, as dual of LDPC codes, seemed to be more adapted for source coding and received 
more attention in the few past years. 

In Q, Martinian and Yedidia show that quantizing a ternary memoryless source coding with erasures is dual 
of the transmission problem over a binary erasure channel. They also prove that LDGM codes, as dual of LDPC 
codes, combined with a modified Belief Propagation (BP) algorithm can saturate the corresponding rate-distortion 
bound. Following their pioneering work, LDGM codes have been extensively studied for lossy compression by several 
researchers [1CH15| . In a series of parallel works, several researches have used techniques from statistical physics to 
provide non-rigorous analysis of LDGM codes [1, [H, [l6j] . However, LDGM codes seem to perform well only for rates 
smaller than 0.5. As we will see, our proposed LDPC codes perform very near to the rate distortion bound for rates 
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larger than 0.5. For smaller rates the loss in performance can be compensated by increasing the complexity (number 
of iterations) of our coding scheme. 

In terms of practical algorithms, lossy compression is still an active research topic. In particular, an asymptotically 
optimal low complexity compressor with near optimal empirical performance has not been found yet. Almost all 
suggested algorithms have been based on some kind of decimation of belief/survey propagation which suffers a com- 
putational complexity of 0{n 2 ) flo| . fTTj | and [HI]. One exception is the algorithm proposed by Murayama When 
the generator matrix is ultra sparse (US), the algorithm was empirically shown to perform very near to the associ- 
ated capacity needing 0(n) computations. A generalized form of this algorithm, called Reinforced Belief Propagation 
(RBP)jl7j , was used in a dual setting [ijj], for ultra sparse LDPC codes (US-LDPC) over GF(2) for lossy compression 
20]. The main drawback in both cases is the non-optimality of ultra sparse structures over GF(2) [5. 10, 12], As we 
will see, this problem can be overcome by increasing the size of the hnite held. 

Estimation of the weight enumerating function show that randomly constructed US-LDPC codes over GF((j) nearly 
achieve the rate-distortion bound for q > 64. Despite this, practical encoding for these codes is a hard task. The 
main problem seems to stem from geometrical properties of the configuration space: as the codes are good for channel 
coding, solutions are isolated and well-separated. This characteristic is known to make the encoding problem difficult 
to solve for iterative and local algorithms [2l|, |22j. To improve this step we introduce the ensemble of b-reduced US- 
LDPC codes, which by eliminating a logarithmic number of constraints from US-LDPC codes, just multiplies the 
number of codewords by a polynomial. This change has a negligible effect in the rate, while having a large effect on 
the performance of the scheme. Indeed, this modification not only improves the convergence of the RBP algorithm 
on encoding, but also provides us with a simple efficient decoding algorithm. 

The rest of this paper is organized as follows. Section [TT1 reviews the code ensemble which we use for lossy compres- 
sion. Section [TTTI describes the RBP algorithm over GF(g). We also discuss briefly the complexity and implementation 
of the RBP algorithm. In section IIVI we describe iterative encoding and decoding for our ensemble and then present 
the corresponding simulation results in section [V] A brief discussion on further research is given in Section I VII 

II. LDPC CODES OVER GF(<?) 

In this section we introduce the ultra sparse LDPC codes over GF(q). As we will see later, near Shannon's bound 
lossy compression is possible using these codes and BP-like iterative algorithms. 

A. (A,p) Ensemble of GF(q) LDPC codes 

We follow the methods and notations in [23] to construct irregular bipartite factor graphs. What distinguishes 
GF(g) LDPC codes from their binary counterparts is that each edge (i, j) of the factor graph has a label h^j G GF(g) 
\{0}. In other words, the non-zero elements of the parity-check matrix of a GF((j) LDPC codes are chosen from the 
non-zero elements of the field GF(q). Denoting the set of variable nodes adjacent to a check node j by J\f(j), a word 
c with components in GF(g) is a codeword if at each check node j the equation ^iJ c i = holds. 

An ensemble of GF(g) LDPC codes is characterized by two generating polynomials \(x) — 

Yli=i ^iZ 1-1 an d p{x) — Yli=i Pi x% ~ X where \ (pi) denotes the fraction of edges incident on variable(check) nodes 
of degree i and d v (d c ) is maximum variable (check) node degree. 

A (A,p) GF(g) LDPC code can be constructed from a (A,p) LDPC code by random independent and identically 
distributed selection of the matrix coefficients with uniform probability from GF(q)\{0}. Note that this may not be 
an optimal way for selecting the coefficients. For more details on code construction and coefficient selection we refer 
the readers to ^ and [jj]. 

B. Code Construction for Lossy Compression 

It is well known that the parity check matrix of a GF(g) LDPC code, optimized for binary input channels, is much 
sparser than the one of a binary LDPC code with same parameters [24j,|26j. In particular, when q > 2 6 , the best error 
rate results on binary input channels is obtained with the lowest possible variable node degrees, i.e., when almost all 
variable nodes have degree two. Such codes have been called ultra sparse or cyclic LDPC codes in the literature. In 
the rest of this paper we call a LDPC code ultra sparse (US-LDPC) if all variable nodes have degree two. We will 
mainly concentrate on codes in which the parity check's degree distribution is concentrated on at most two different 
degree values, for any given rate. 
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Given a linear code C and an integer b, a 6-reduction of C is the code obtained by randomly eliminating b parity- 
check nodes of C. For reasons to be cleared in section ITVl we are mainly interested in ^-reduction of GF(q) US-LDPC 
codes for small values of b (1 < b < 5). 

GF((j) US-LDPC codes have been extensively studied for transmission over noisy channels [27| . [28] . [26j . The 
advantage of using such codes is twofold. On the one hand, by moving to sufficiently large fields, it is possible to 
obtain nearly capacity achieving codes. On the other hand, the extreme sparseness of the factor graph is well-suited 
for iterative message-passing decoding algorithms. Despite the state of the art performance of moderate length GF(g) 
US-LDPC channel codes, they have been less studied for lossy compression. The main reason being the lack of fast 
suboptimal algorithms. In the next section we present RBP algorithm over GF(g) and then show that practical 
encoding for lossy compression is possible by using RBP as the encoding algorithm for the ensemble of 6-reduced 
US-LDPC codes. 



III. REINFORCED BELIEF PROPAGATION ALGORITHM IN GF(g) 

In this section first we briefly review the RBP equations over GF(g) and then we discuss in some details the 
complexity of the algorithm following Declercq and Fossorier (28j . 

A. BP and RBP Equations 

The GF(q) Belief Propagation (BP) algorithm is a straightforward generalization of the binary case, where the 
messages are q-dimensional vectors. 

Let fi^* denotes the message vector form variable node v to check node / at the £th iteration. For each symbol 

a GGF(g), the ath component of /i^j is the probability that variable v takes the value a and is denoted by /i^(a). 
Similarly, [i^ v denotes the message vector from check node / to variable node v at the iteration I and (a) is its ath 
component. Also let N(v) (Ai(f)) denote the set of check (variable) nodes adjacent to v (/) in a given factor graph. 

Constants ^ are initialized according to the prior information. The BP updating rules can be expressed as follows: 

Local Function to Variable: 

fJ-j v (a)oc JT i4,f(a) (1) 

Conf ( , i/) (o) v'£M(f)\{v} 

Variable to Local Function: 

Hif(a) cx fil(a) TJ 4» (2) 
f'eM(v)\{f} 

where Conf(„j)(a) is the set of all configurations of variables in Ai(f) which satisfy the check node / when the 
value of variable v is fixed to a. We define the marginal function of variable v at iteration £ + 1 as 

g^Woc/^a) TJ n%{a). (3) 

The algorithm converges after t iterations if and only if for all variables v and all function nodes / 

= 

up to some precision e. A predefined maximum number of iterations ^ max and the precision parameter e are the input 
to the algorithm. 

RBP is a generalization of BP in which the messages from variable nodes to check nodes are modified as follows 

t4fHo) <* (g'») 7( V(a) J] /4», (4) 

f>eAf(v)\{f} 

where : [0, 1] — > [0, 1] is a non-decreasing function and gfj is the marginal function of variable v at iteration £. 
The marginals for RBP are defined as 
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gi +1 (a) cx (gS(a)) 7W A*i(o) I] 4»- ( 5 ) 

/6JV(«) 

Intuitively, RBP equations can be thought as a sort of "soft-decimation" procedure. Indeed, in a decimation 
procedure [18[ , the BP equations are iterated until convergence, and then an infinite external field with the same sign 
of the local field is applied to one or more variables and the process is repeated (until all variables receive an infinite 
field). In the RBP procedure, every variable receives a finite external field which is proportional to its own local field 
(the proportionality factor being j(£)). Moreover, the two time-scales (convergence and external field update) are 
intermixed. 

It is convenient to define 7 to be 

1(1) = l-707i, (6) 

where 70,71 are in [0, 1]. 

Note that when 71 = 1, RBP is the same as the algorithm presented in Q for lossy data compression. In this case 
it is easy to show that the only fixed points of RBP arc configurations that satisfy all the constraints. 

B. Efficient Implementation 

Ignoring the normalization factor in ([2]l. to compute all variable to check- node messages at a variable node of 
degree d v we need 0(q ■ d v ) computations. A naive implementation of GF(q) BP has computational complexity of 
0(d 2 ■ q 2 ) operations at each check node of degree df. This high complexity is mainly due to the sum in ([T]), that 
can be interpreted as a discrete convolution of probability density functions. Efficient implementations of function 
to variable node messages based on Discrete Fourier Transform (DFT) have been proposed by several authors, see 
for example (24l. [28l - [30j and the references within. The procedure consists in using the identity Q v i e M(f)\{v} = 
J 7 [Ylv' eM(f)\{v} -^"(/V/) ) where the symbol denotes convolution of functions over GF(q), and the product on 
the right-hand side is the pointwise product of real-valued functions. 

Assuming q — 2 P , the Fourier transform of each message fjb v it needs 0(q ■ p) computations and hence the total 
computational complexity at check node / can be reduced into 0(d 2 f ■ q ■ p). This complexity can be further reduced 
to 0(df ■ q -p) by using the fact that IL'eM(/)\W ^ (tbif) = Ylv'eM(f) ^(Pv'f) /^(/Vf), or alternatively by using 
the summation strategy described in [3lj which has the same complexity but is numerically more stable. Therefore, 
the total number of computations per iteration is Old ■ q-p-n) where d is the average check-node degree. A prototipe 
CH — h implementation of these equations is provided in source form [35]. 

IV. ITERATIVE LOSSY COMPRESSION 

In the following three subsections we first describe a simple method for identifying information bits of a 6-reduced 
US-LDPC code and then present a near optimal scheme for iterative compression (encoding) and linear decompression 
(decoding) . 

A. Identifying a Set of Information Bits 

For 6-reduced US-LDPC codes, one can use the leaf removal (LR) algorithm to find the information bits in linear 
time. In the rest of this section we briefly review the LR algorithm and show that f-reduction (removal of a sole check 
node) of a US-LDPC code significantly changes the intrinsic structure of the factor graph of the original code. 

The main idea behind LR algorithm is that a variable on a leaf of a factor graph can be fixed in such a way that 
the check node to which it is connected is satisfied [32| . Given a factor graph, LR starts from a leaf and removes it 
as well as the check node it is connected to. LR continues this process until no leaf remains. The residual sub-graph 
is called the core. Note that the core is independent of the order in which leaves (and hence the corresponding check 
nodes) are removed from the factor graph. This implies that also the number of steps needed to find the core does 
not depend on the order on which leaves are chosen. 

While US-LDPC codes have a complete core, i.e. there is no leaf in their factor graph, the 6-reduction of these 
codes have empty core. Our simulations also indicate that even I-reduction of a code largely improves the encoding 
under RBP algorithm (see section [Vj) . How RBP exploits this property is the subject of ongoing research. 
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As we have mentioned, the LR algorithm can be also used to find a set of information bits of a given US-LDPC 
code. Let us examine the LR algorithm in more detail. At any step t of LR algorithm, a leaf variable node v t attached 
to a factor node ft is selected. Denote by F t the remaining leaf variable nodes attached to check node ft (F t could 
be empty if there are no other leaves attached to it). Now we remove check node ft and leaf nodes in F t U and 
repeat. Under the hypothesis that the original graph was connected, this process is guarateed to finish at some time 
T with the empty graph, as at each step except the last one, at least one leaf is created. 

It is easy to see that at each step, if we fix the values of all variables except those in F t U {ft}, then for each con- 
figuration of variables in F tl the value of variable Vt is uniquely determined. Therefore, Ut=i...TFt = ■ • ■ , wjv-t} 
will form a set of information bits. Indeed, using the ordering of the variable indices vt , ... , «i, ii>i , w^-t, the check 
matrix becomes upper triangular and each solution can be found by back substitution in linear time once information 
bits are fixed. 



B. Iterative Encoding 

Suppose a code of rate R and a source sequence y is given. In order to find the codeword y that minimizes dn (y, y), 
we will employ the RBP algorithm with a strong prior nl(a) = exp(— Ldn (y v , «)) centered around y. The sequence 
of information bits of y is the compressed sequence and is denoted by x. In order to process the encoding in GF(q), 
we first need to map y into a sequence in GF(q). This can be simply done by grouping p bits together and use the 
binary representation of the symbols in GF(q). 

C. Linear Decoding 

Given the sequence of information bits x, the goal of the decoder is to find the corresponding codeword y. This can 
be done by calculating the G T x which in general needs 0{n?) computations. One of the advantages of our scheme is 
that it allows for a linear complexity iterative decoding. The decoding can be performed by iteratively fixing variables 
following the inverse steps of the LR algorithm; at each step t only one non-information bit is unknown (variable Vt) 
and its value can be determined from the parity check ft. For a sparse parity-check matrix, the number of needed 
operations is O(n). It is straightforward to show that a code has empty core if and only if there exists a permutation 
of columns of the corresponding parity-check matrix H such that hij ^ for i — j and hij = for all i > j. The 
decoding procedure is equivalent to back-substitution on this permutated triangular matrix. 

V. SIMULATION RESULTS 

A. Approximating the Weight Enumeration Function by BP 

Given an initial vector y, and a probability distribution P(c) over all configurations, the P-average distance from 
y can be computed by 

i Ci 

where P(ci) is the set of marginals of P. On the other hand, the entropy of the distribution P is defined by 

S(P) = -JP(e)logP(c). (8) 

c 

Even though it is a hard problem to calculate analytically both mar gina ls and S(P) of a given code, one may 
approximate them using messages of the BP algorithm at a fixed point [331 ] - Assuming the normalized distance is 
asymptotically a self-averaging quantity for our ensemble, S(P) represents the logarithm of the number of codeword 
at distance Dp(y) from y. By applying a prior distribution on codewords given by exp(— Ldn (c, y)) one is able to 
sample the sub-space of codewords at different distances from y. 

Figure [1] demonstrates the weight enumerator function (WEF) of random GF(q) US-LDPC codes for rates 0.3, 0.5, 
and 0.7 and field orders 2, 4, 16, 64 and 256. The blocklength is normalized so that it corresponds to n = 12000 
binary digits. 
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FIG. 1: (Color online) The approximate WEF of GF(g) US-LDPC codes as a function of q for a same blocklength in binary 
digits. 

Though BP is not exact over loopy graphs, we conjecture that the WEF calculated for US-LDPC codes is asymp- 
totically exact. This hypothesis can be corroborated by comparing the plot in figure Q] with the simulation results we 
obtained by using RBP algorithm (figure [3]) . 

B. Performance 

For the simplicity of the analysis, in all our simulations the parameter 71 of RBP algorithm is fixed to one and 
therefore the function 7 is constant. We also fix the maximum number of iterations into £ m ax = 300. If RBP does not 
converge after 300 iterations, we simply restart RBP with a new random scheduling. The maximum number of trials 
allowed in our simulations is T max — 5. The encoding performance depends on several parameters such as 70, L, the 
field order q, and the blocklength n. In the following we first fix n, q and L, in order to see how the performance 
changes as a function of 70 . 

1. Performance as a Function 0/70 

We will show that, with this choice of j(£) = 1 — 70 there is a trade off, controlled by 70, between three main 
aspects of the performance, namely: average distortion, average number of iterations and average number of trials. 
The simulations in this subsection are done for a 5-reduced GF(64) US-LDPC code with l eng th n = 1600 and rate 
R = 0.33. The factor graph is made by Progressive-Edge-Growth (PEG) construction [27[. The rate is chosen 
purposefully from a region where our scheme has the weakest performance. The Shannon's distortion bound for 
this rate is approximately 0.1754. Note that the non-monotonous behaviour of RBP as a function of 70 could be a 
result of two concurrent phenomena: for small 70 the reinforcement dynamics is too fast and may drive the system 
to non-codewords, for large 70 the reinforcement contribution is small and the system does not achieve polarization 
under the predefined iteration bound. In the latter case, better performance may be achieved with 71 < 1 or simply 
with a different choice of 7^). 

In figure [5] we plot the performance as a function of 70. For 70 = 0.92 we achieve a distortion of D = 0.1851 
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FIG. 2: (Color online) Performance as a function of 70 for a PEG graph with n=1600 and R=0.33. The averages are taken 
over 50 samples. (a) Average distortion as a function of 70. For 70 > 0.96 the RBP does not converge within 300 iterations. 
(b)The average number of iterations. (c)The average number of trials, (d) The average number of iterations needed for each 
trial. Note that even though average number of iterations show a steep increase as a function of 70, the average number of 
iterations needed per trial increases only linearly. 

needing only 83 iterations in average and without any need to restart RBP for 50 samples. By increasing 70 to 0.96, 
one can achieve an average distortion of 0.1815 which is only 0.15 dB away from the rate-distortion bound needing 
270 iterations in average. However, as it can be seen in figure (Hd), the average number of iterations needed per trial 
increases only linearly as a function of 70 . 

2. Performance as a function of R and q 

Figure [3] shows the distortion obtained by randomly generated 5-reduced GF(q) US-LDPC codes for q = 2, q = 16, 
q = 64 and q = 256. The block length is fixed to n = 12000 binary digits. For each given code with rate larger than 
or equal to 0.3, we choose 70 and L so that the average number of trials does not exceed 2 and the average number of 
iterations remains less than 300. The optimized values of 70 and L are found by simulations and are reported in table 
Ufor q = 256. Under these two conditions, we report distortion corresponding to best values of the two parameters 
averaged over 50 samples. For codes with rates smaller than 0.3, one needs to allow for larger number of iterations 
and trials. 

As the data in tabic [T] indicate, by increasing rate, both L and 1 — 70 increase. Larger values of L impose stronger 
prior values, indicating that the initialized message distribution is more centered around y. Note that in high rates, if L 
is not chosen large enough, the loss in performance is substantial. On the other hand, 70 regulates the reinforcement 
needed. Values very near to one for low rates indicates essentially the failure of reinforced strategy. This is not 
surprising, since in the absence of a codeword near y, forcing BP to find a solution is useless. 
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Rate 

FIG. 3: (Color online) The rate-distortion performance of GF(g) LDPC codes encoded with RBP algorithm for q = 2,16,64 
and 256. The blocklength is 12000 binary digits and each point is the average distortion over 50 samples. 



TABLE I: The optimal values for L and 70 obtained experimentally for q = 256. 



Rate 


0.1 


0.2 


0.3 


0.4 


0.5 


0.6 


0.7 


0.8 


0.9 


L 


1.1 


1.3 


1.5 


1.7 


1.9 


2.3 


2.4 


2.8 


3.8 


70 


0.98 


0.96 


0.94 


0.92 


0.92 


0.90 


0.90 


0.88 


0.88 



3. Reduction Effect on Performance of US-LDPC Codes 

As we have mentioned, 5-reduced LDPC codes have been used in our simulations. The reduction improves both 
the convergence of RBP algorithm and the performance of the our scheme. In figure [4] we show how the performance 
changes as a function of b. The simulations in this subsection are done for a GF(64) US-LDPC code of length n = 1600 
and rate R = 0.33 with PEG constructed factor graph. 

VI. DISCUSSION ON REDUCED FACTOR GRAPHS 

Our results indicate that the scheme proposed in this paper outperforms the existing methods for lossy compression 
by low-density structures in both performance and complexity. The main open problem is to understand and analyze 
the behaviour of RBP over 6-reduced US-LDPC codes. 

We would like to add a few words to the role of 6-reduction. For simplicity, let us concentrate on q = 2, though 
the argument is general. First note that by removing a parity check node from a code, the number of codewords is 
doubled. This increment has an asymptotically negligible effect on the compression rate since it only increases by 1/n, 
while the robustness may increase. More generally, it is possible to significantly alter the geometry of the solution 
space while maintaining (asymptotically) the compression rate: for instance, adding a path {c = di, d.2, ■ • • , d& = 0} 
of new codewords from each codeword c of a given code to the codeword 0, such that dn(d.t, d t+ i) = 1/n and 
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FIG. 4: (Color online) Performance as a function of 6 for a PEG graph with n=1600 and R=0.33 over GF(64). The averages 
are taken over 50 samples. 

k < n, multiplies the number of codewords by at most n and thus increases the rate by at most log n/n which is 
asymptotically negligible. On the other hand, the codeword space becomes "star-shaped" and thus connected on the 
hypercube geometry. Note that such modified codes may be terrible for channel coding, as the separation properties 
may have been severely worsened (e.g. the minimum distance of the code becomes 1). 

We think that a similar phenomenon could take place on 6-reduced codes. On the one hand, the asymptotic rate 
for source coding under the proposed scheme is only increased by b/n and the performance assuming MAP encoding 
can only improve. On the other hand, we believe that the implied modification of the geometry could ease the task 
of our iterative encoder. Indeed, it is well known that large separation between solutions makes the problem very 
hard for iterative and local algorithms [U H3] ■ In the following we briefly explain some asymptotic implications of 
1-reduction on weight enumerating function of the US-LDPC code ensemble. 

4. 1 -Reduced US-LDPC codes 

As we have mentioned, cancelling a single check node increases the the cardinality of the code by the factor q. As 
we will see, for each codeword c of the original US-LDPC, there are created q — 1 new codewords which all have a 
distance O(logn) from c. In other words, a cluster of new codewords emerges for each codeword c. In order to see 
this fact, let v and v' denote two variables of degree one after removing the parity check a. With a probability which 
approaches one, the checknode a and both variables v and v' belong to a loop of length O(log n) of the original factor 
graph as n — > 00. After removing a, this loop is broken and for any codeword of the original factor graph one can 
obtain new codewords by assigning to v any value from the finite field and changing accordingly the values of all 
variables in the broken loop. Note that this can be done because all variables in the broken loop have degree two and 
v' can be adjusted to satisfy the last checknode in the path from v to v' . 

VII. CONCLUSIONS AND PERSPECTIVES 

Our main goal in this paper is to provide a low complexity coding scheme for lossy data compression with near 
rate-distortion bound performance. We propose a practical iterative encoding/decoding scheme that exploits the 
geometrical structure of the so called reduced ultra sparse low density parity check codes. Our proposed algorithm 
for encoding can be considered as a soft decimation strategy for belief propagation algorithm. The complexity per 
iteration at the iterative encoder depends linearly on both the length of the code and the order of the field on which 



10 



the code is defined. The decoding algorithm is based on leaf removal algorithm which has linear complexity on the 
proposed sparse factor graphs. 

We have investigated the behaviour of our scheme for various field orders and parameters of the proposed algorithm. 
In particular, we approximately calculate the weight enumerating function of US-LDPC codes as a function of field 
order using the BP algorithm. Our estimations show that US-LDPC codes over GF(q') nearly achieve the rate- 
distortion bound for q > 64. Though BP is not exact over loopy graphs, we conjecture that the WEF calculated for 
US-LDPC codes is asymptotically exact. This hypothesis is corroborated by the simulation results we obtained by 
using RBP algorithm. 

Our research can be expanded in several directions. For example, it is interesting to study other ultra sparse 
ensembles sharing similar properties, e.g. where just a certain fraction of variable nodes of degree one is allowed. 
Several directions could be explored in order to obtain more efficient coding schemes: other choices of the reinforcement 
rate "/(£), choices of random codes and coefficient selection, and a L — > oo version of the encoder along the lines of 



34] as it could allow much lower computational complexity. Work is in progress in these direction. 
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